In the 21st century, cars are an important mode of transportation that provides us the opportunity for personal control and autonomy. In day-to-day life, people use cars for commuting to work, shopping, visiting family and friends, etc. Research shows that more than 76% of people prevent themselves from traveling somewhere if they don't have a car. Most people tend to buy different types of cars based on their day-to-day necessities and preferences. So, it is essential for automobile companies to analyze the preference of their customers before launching a car model into the market. Austo, a UK-based automobile company aspires to grow its business into the US market after successfully establishing its footprints in the European market.
In order to be familiar with the types of cars preferred by the customers and factors influencing the car purchase behavior in the US market, Austo has contracted a consulting firm. Based on various market surveys, the consulting firm has created a dataset of 3 major types of cars that are extensively used across the US market. They have collected various details of the car owners which can be analyzed to understand the automobile market of the US.
Objective
Austo’s management team wants to understand the demand of the buyers and trends in the US market. They want to build customer profiles based on the analysis to identify new purchase opportunities so that they can manipulate the business strategy and production to meet certain demand levels. Further, the analysis will be a good way for management to understand the dynamics of a new market. Suppose you are a Data Scientist working at the consulting firm that has been contracted by Austo. You are given the task to create buyer’s profiles for different types of cars with the available data as well as a set of recommendations for Austo. Perform the data analysis to generate useful insights that will help the automobile company to grow its business.
Data Description
austo_automobile.csv: The dataset contains buyer’s data corresponding to different types of products(cars).
Data Dictionary
Age: Age of the customer Gender: Gender of the customer Profession: Indicates whether the customer is a salaried or business person Marital_status: Marital status of the customer Education: Refers to the highest level of education completed by the customer No_of_dependents: Number of dependents(partner/children/spouse) of the customer Personal_loan: Indicates whether the customer availed a personal loan or not House_loan: Indicates whether the customer availed house loan or not Partner_working: Indicates whether the customer's partner is working or not Salary: Annual Salary of the customer Partner_salary: Annual Salary of the customer's partner Total_salary: Annual household income (Salary + Partner_salary) of the customer's family Price: Price of the car Make: Car type (Hatchback/Sedan/SUV)
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set (color_codes=True)
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv ("Downloads/download-3.csv")
df = data.copy()
df.head()
| Age | Gender | Profession | Marital_status | Education | No_of_Dependents | Personal_loan | House_loan | Partner_working | Salary | Partner_salary | Total_salary | Price | Make | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24 | Male | Salaried | Married | Post Graduate | 4 | No | Yes | Yes | 52000 | 25000 | 77000 | 18000 | Hatchback |
| 1 | 28 | Male | Salaried | Married | Post Graduate | 3 | No | Yes | No | 68000 | 0 | 68000 | 18000 | Hatchback |
| 2 | 23 | Male | Salaried | Married | Graduate | 4 | Yes | Yes | No | 31000 | 0 | 31000 | 18000 | Hatchback |
| 3 | 26 | Male | Business | Married | Post Graduate | 4 | Yes | Yes | Yes | 66000 | 35000 | 101000 | 18000 | Hatchback |
| 4 | 28 | Male | Business | Married | Post Graduate | 4 | Yes | No | No | 64000 | 0 | 64000 | 18000 | Hatchback |
df.head (50)
| Age | Gender | Profession | Marital_status | Education | No_of_Dependents | Personal_loan | House_loan | Partner_working | Salary | Partner_salary | Total_salary | Price | Make | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24 | Male | Salaried | Married | Post Graduate | 4 | No | Yes | Yes | 52000 | 25000 | 77000 | 18000 | Hatchback |
| 1 | 28 | Male | Salaried | Married | Post Graduate | 3 | No | Yes | No | 68000 | 0 | 68000 | 18000 | Hatchback |
| 2 | 23 | Male | Salaried | Married | Graduate | 4 | Yes | Yes | No | 31000 | 0 | 31000 | 18000 | Hatchback |
| 3 | 26 | Male | Business | Married | Post Graduate | 4 | Yes | Yes | Yes | 66000 | 35000 | 101000 | 18000 | Hatchback |
| 4 | 28 | Male | Business | Married | Post Graduate | 4 | Yes | No | No | 64000 | 0 | 64000 | 18000 | Hatchback |
| 5 | 36 | Male | Salaried | Married | Post Graduate | 3 | Yes | No | No | 66000 | 0 | 66000 | 49000 | SUV |
| 6 | 28 | Male | Business | Married | Graduate | 4 | Yes | No | Yes | 48000 | 40000 | 88000 | 18000 | Hatchback |
| 7 | 27 | Male | Salaried | Married | Post Graduate | 3 | No | No | Yes | 76000 | 35000 | 111000 | 18000 | Hatchback |
| 8 | 46 | Female | Salaried | Single | Post Graduate | 0 | Yes | No | No | 80000 | 0 | 80000 | 49000 | SUV |
| 9 | 24 | Male | Salaried | Single | Graduate | 2 | No | Yes | No | 30000 | 0 | 30000 | 18000 | Hatchback |
| 10 | 46 | Female | Salaried | Married | Graduate | 4 | No | No | No | 70000 | 0 | 70000 | 49000 | SUV |
| 11 | 26 | Male | Business | Married | Post Graduate | 4 | No | No | No | 67000 | 0 | 67000 | 18000 | Hatchback |
| 12 | 23 | Male | Salaried | Married | Post Graduate | 3 | Yes | No | No | 52000 | 0 | 52000 | 18000 | Hatchback |
| 13 | 44 | Female | Business | Single | Post Graduate | 2 | No | No | No | 84000 | 0 | 84000 | 31000 | Sedan |
| 14 | 35 | Male | Business | Married | Post Graduate | 2 | Yes | Yes | Yes | 61000 | 35000 | 96000 | 31000 | Sedan |
| 15 | 29 | Male | Salaried | Married | Post Graduate | 4 | No | No | No | 76000 | 0 | 76000 | 18000 | Hatchback |
| 16 | 32 | Male | Salaried | Married | Post Graduate | 1 | Yes | No | Yes | 56000 | 35000 | 91000 | 31000 | Sedan |
| 17 | 23 | Male | Salaried | Married | Post Graduate | 3 | Yes | Yes | No | 58000 | 0 | 58000 | 18000 | Hatchback |
| 18 | 30 | Male | Business | Married | Post Graduate | 4 | No | No | Yes | 67000 | 22000 | 89000 | 18000 | Hatchback |
| 19 | 27 | Male | Business | Married | Post Graduate | 3 | Yes | No | Yes | 65000 | 35000 | 100000 | 18000 | Hatchback |
| 20 | 39 | Female | Business | Married | Post Graduate | 4 | No | No | Yes | 72000 | 50000 | 122000 | 49000 | SUV |
| 21 | 22 | Male | Salaried | Married | Post Graduate | 3 | Yes | No | No | 53000 | 0 | 53000 | 18000 | Hatchback |
| 22 | 40 | Male | Business | Married | Graduate | 1 | No | Yes | No | 59000 | 0 | 59000 | 31000 | Sedan |
| 23 | 47 | Male | Salaried | Single | Post Graduate | 2 | Yes | No | No | 87000 | 0 | 87000 | 49000 | SUV |
| 24 | 53 | Male | Salaried | Married | Graduate | 4 | Yes | No | Yes | 80000 | 50000 | 130000 | 49000 | SUV |
| 25 | 30 | Male | Business | Married | Graduate | 2 | No | No | Yes | 42000 | 26000 | 68000 | 18000 | Hatchback |
| 26 | 22 | Male | Salaried | Married | Graduate | 2 | No | No | Yes | 39000 | 35000 | 74000 | 18000 | Hatchback |
| 27 | 28 | Female | Salaried | Married | Post Graduate | 2 | Yes | No | No | 60000 | 0 | 60000 | 31000 | Sedan |
| 28 | 58 | Male | Salaried | Married | Post Graduate | 3 | No | No | Yes | 87000 | 26000 | 113000 | 50000 | SUV |
| 29 | 42 | Female | Business | Married | Graduate | 2 | No | No | Yes | 54000 | 35000 | 89000 | 31000 | Sedan |
| 30 | 24 | Male | Business | Married | Post Graduate | 3 | No | Yes | Yes | 57000 | 40000 | 97000 | 18000 | Hatchback |
| 31 | 28 | Female | Salaried | Married | Graduate | 1 | Yes | No | Yes | 36000 | 27000 | 63000 | 31000 | Sedan |
| 32 | 26 | Male | Business | Married | Post Graduate | 3 | No | No | Yes | 69000 | 40000 | 109000 | 18000 | Hatchback |
| 33 | 44 | Male | Salaried | Single | Post Graduate | 0 | Yes | No | No | 68000 | 0 | 68000 | 50000 | SUV |
| 34 | 24 | Male | Salaried | Married | Post Graduate | 3 | Yes | No | Yes | 60000 | 40000 | 100000 | 18000 | Hatchback |
| 35 | 58 | Male | Business | Married | Post Graduate | 3 | Yes | No | Yes | 78000 | 40000 | 118000 | 50000 | SUV |
| 36 | 26 | Male | Business | Married | Graduate | 3 | No | No | No | 51000 | 0 | 51000 | 18000 | Hatchback |
| 37 | 27 | Male | Salaried | Married | Graduate | 2 | No | No | Yes | 49000 | 26000 | 75000 | 18000 | Hatchback |
| 38 | 25 | Male | Salaried | Married | Post Graduate | 2 | No | Yes | Yes | 54000 | 50000 | 104000 | 18000 | Hatchback |
| 39 | 27 | Male | Salaried | Married | Post Graduate | 3 | Yes | No | Yes | 71000 | 25000 | 96000 | 18000 | Hatchback |
| 40 | 29 | Female | Salaried | Single | Graduate | 2 | No | Yes | No | 37000 | 0 | 37000 | 31000 | Sedan |
| 41 | 27 | Male | Business | Married | Post Graduate | 3 | No | No | Yes | 63000 | 40000 | 103000 | 18000 | Hatchback |
| 42 | 29 | Male | Salaried | Married | Graduate | 3 | No | No | No | 57000 | 0 | 57000 | 18000 | Hatchback |
| 43 | 28 | Male | Business | Married | Graduate | 2 | Yes | No | Yes | 43000 | 25000 | 68000 | 18000 | Hatchback |
| 44 | 56 | Male | Salaried | Married | Graduate | 3 | Yes | No | Yes | 72000 | 50000 | 122000 | 50000 | SUV |
| 45 | 28 | Male | Salaried | Married | Post Graduate | 3 | Yes | Yes | Yes | 60000 | 25000 | 85000 | 18000 | Hatchback |
| 46 | 38 | Male | Business | Married | Graduate | 2 | Yes | Yes | No | 65000 | 0 | 65000 | 31000 | Sedan |
| 47 | 42 | Male | Salaried | Married | Post Graduate | 3 | No | No | Yes | 74000 | 50000 | 124000 | 50000 | SUV |
| 48 | 43 | Male | Salaried | Married | Post Graduate | 2 | No | No | Yes | 75000 | 30000 | 105000 | 31000 | Sedan |
| 49 | 37 | Male | Salaried | Married | Post Graduate | 3 | Yes | No | No | 73000 | 0 | 73000 | 50000 | SUV |
data.shape
(1581, 14)
Comment
The data has 1581 rows and 14 columns.
data.dtypes. value_counts()
object 8 int64 6 dtype: int64
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1581 entries, 0 to 1580 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1581 non-null int64 1 Gender 1581 non-null object 2 Profession 1581 non-null object 3 Marital_status 1581 non-null object 4 Education 1581 non-null object 5 No_of_Dependents 1581 non-null int64 6 Personal_loan 1581 non-null object 7 House_loan 1581 non-null object 8 Partner_working 1581 non-null object 9 Salary 1581 non-null int64 10 Partner_salary 1581 non-null int64 11 Total_salary 1581 non-null int64 12 Price 1581 non-null int64 13 Make 1581 non-null object dtypes: int64(6), object(8) memory usage: 173.0+ KB
Observations
All column have 1,581 each. The data contians intger and object data-type.
df['Gender'] = df.Gender.astype('category')
df['Profession'] = df.Profession.astype('category')
df['Marital_status'] = df.Marital_status.astype('category')
df['Education'] = df.Education.astype ('category')
df['Personal_loan'] = df.Personal_loan.astype('category')
df['House_loan'] = df.House_loan.astype('category')
df['Partner_working'] = df.Partner_working.astype('category')
df['Make'] = df.Make.astype('category')
# Converting categorical column to categorical type
df.dtypes
Age int64 Gender category Profession category Marital_status category Education category No_of_Dependents int64 Personal_loan category House_loan category Partner_working category Salary int64 Partner_salary int64 Total_salary int64 Price int64 Make category dtype: object
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 1581.0 | 32.211259 | 9.125477 | 22.0 | 25.0 | 29.0 | 38.0 | 60.0 |
| No_of_Dependents | 1581.0 | 2.457938 | 0.943483 | 0.0 | 2.0 | 2.0 | 3.0 | 4.0 |
| Salary | 1581.0 | 59732.447818 | 14278.642665 | 30000.0 | 51000.0 | 59000.0 | 71000.0 | 90000.0 |
| Partner_salary | 1581.0 | 19720.430108 | 19480.214404 | 0.0 | 0.0 | 25000.0 | 38000.0 | 80000.0 |
| Total_salary | 1581.0 | 79452.877925 | 24855.936043 | 30000.0 | 61000.0 | 78000.0 | 96000.0 | 158000.0 |
| Price | 1581.0 | 35597.722960 | 13633.636545 | 18000.0 | 25000.0 | 31000.0 | 47000.0 | 70000.0 |
df.describe(exclude='number').T
| count | unique | top | freq | |
|---|---|---|---|---|
| Gender | 1581 | 2 | Male | 1252 |
| Profession | 1581 | 2 | Salaried | 896 |
| Marital_status | 1581 | 2 | Married | 1443 |
| Education | 1581 | 2 | Post Graduate | 985 |
| Personal_loan | 1581 | 2 | Yes | 792 |
| House_loan | 1581 | 2 | No | 1054 |
| Partner_working | 1581 | 2 | Yes | 868 |
| Make | 1581 | 3 | Hatchback | 884 |
cat_col=['Gender', 'Profession', 'Marital_status', 'Education', 'Personal_loan', 'House_loan', 'Partner_working', 'Make']
for column in cat_col:
print(df[column].value_counts())
print('-'*50)
Male 1252 Female 329 Name: Gender, dtype: int64 -------------------------------------------------- Salaried 896 Business 685 Name: Profession, dtype: int64 -------------------------------------------------- Married 1443 Single 138 Name: Marital_status, dtype: int64 -------------------------------------------------- Post Graduate 985 Graduate 596 Name: Education, dtype: int64 -------------------------------------------------- Yes 792 No 789 Name: Personal_loan, dtype: int64 -------------------------------------------------- No 1054 Yes 527 Name: House_loan, dtype: int64 -------------------------------------------------- Yes 868 No 713 Name: Partner_working, dtype: int64 -------------------------------------------------- Hatchback 884 Sedan 460 SUV 237 Name: Make, dtype: int64 --------------------------------------------------
Observations
1.Marrried, Educated Male that does not own a house bought most of the cars. 2.Majority of the car bought was hatcback 3.The median Age, No of dependents, salary, partner-salary, total salary, and price is 29, 2, 59,000, 25,000, 78,000, 31,000 respectively. 4.Max age is 60 and minimum 22 5.Max total salary is 158, 000 and minimum 30,000 6.The average price for a car is between 31,000 - 35,000.
df.isna().sum()
Age 0 Gender 0 Profession 0 Marital_status 0 Education 0 No_of_Dependents 0 Personal_loan 0 House_loan 0 Partner_working 0 Salary 0 Partner_salary 0 Total_salary 0 Price 0 Make 0 dtype: int64
No missing value in the data set
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2,
sharex=True,
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
)
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
)
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
)
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
)
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
)
histogram_boxplot(df, 'Age')
df['Age']. mode()
0 28 dtype: int64
# Which age is paying the highest charges for car?
data[df['Age'] == df['Age'].max()]
| Age | Gender | Profession | Marital_status | Education | No_of_Dependents | Personal_loan | House_loan | Partner_working | Salary | Partner_salary | Total_salary | Price | Make | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 70 | 60 | Female | Salaried | Married | Post Graduate | 3 | No | No | No | 77000 | 0 | 77000 | 50000 | SUV |
| 201 | 60 | Female | Salaried | Married | Post Graduate | 3 | Yes | No | Yes | 80000 | 50000 | 130000 | 51000 | SUV |
| 470 | 60 | Female | Salaried | Married | Post Graduate | 3 | No | No | Yes | 84000 | 60000 | 144000 | 57000 | SUV |
| 502 | 60 | Male | Salaried | Married | Post Graduate | 3 | No | No | Yes | 79000 | 60000 | 139000 | 57000 | SUV |
| 514 | 60 | Male | Salaried | Married | Post Graduate | 3 | No | No | No | 83000 | 0 | 83000 | 57000 | SUV |
| 559 | 60 | Female | Salaried | Married | Post Graduate | 3 | No | No | Yes | 79000 | 60000 | 139000 | 57000 | SUV |
| 920 | 60 | Male | Salaried | Married | Post Graduate | 4 | No | No | No | 89000 | 0 | 89000 | 61000 | SUV |
| 935 | 60 | Female | Salaried | Married | Graduate | 2 | Yes | No | Yes | 72000 | 70000 | 142000 | 61000 | SUV |
| 943 | 60 | Female | Salaried | Married | Post Graduate | 4 | Yes | No | Yes | 82000 | 70000 | 152000 | 61000 | SUV |
| 972 | 60 | Male | Business | Married | Post Graduate | 4 | No | No | Yes | 86000 | 70000 | 156000 | 61000 | SUV |
| 977 | 60 | Female | Salaried | Married | Graduate | 3 | Yes | No | No | 77000 | 0 | 77000 | 62000 | SUV |
| 1107 | 60 | Male | Salaried | Married | Post Graduate | 2 | No | No | No | 81000 | 0 | 81000 | 63000 | SUV |
| 1320 | 60 | Female | Business | Married | Graduate | 3 | No | No | No | 74000 | 0 | 74000 | 66000 | SUV |
| 1336 | 60 | Male | Salaried | Married | Post Graduate | 4 | Yes | No | No | 81000 | 0 | 81000 | 67000 | SUV |
| 1450 | 60 | Male | Salaried | Married | Graduate | 4 | No | No | Yes | 79000 | 40000 | 119000 | 68000 | SUV |
1.Maximum age is 60, minimum is 22
2.The age distribution is skewened to the right
3.Majority of the age is middle aged
4.Median age is equal 29 but the mean is ~32
5.There are outliers in this variable.
6.Age 60 is buying the expensive car
histogram_boxplot(df, 'No_of_Dependents')
histogram_boxplot(df, 'Salary')
histogram_boxplot(df, 'Partner_salary')
histogram_boxplot(df, 'Total_salary')
1.Salary and Total-salary does not have any outliers.
2.The distribution for Salary and Total-salary is close to normal, suggesting possible correlation between the two variables.
3.There is no outlier for all of the salary variables except total salary.
4.Partner-salary is skewed to the right, Also there is possible correlation between Partner-salary and dependents because of the visual proportion.
histogram_boxplot(df, 'Price')
1.The highest price for car is 80,000 and the minimum 18,,000.
2.The skewness follows the proportion of the partner salary iNdicating correlation.
def labeled_barplot(data, feature, perc=False, n=None):
total = len(data[feature])
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
)
else:
label = p.get_height()
x = p.get_x() + p.get_width() / 2
y = p.get_height()
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
)
plt.show()
plt.figure(figsize=(10,7))
# Create a function that returns a Pie chart for categorical variable:
def pie_chart(x = 'smoker'):
"""
Function creates a Pie chart for categorical variables.
"""
fig, ax = plt.subplots(figsize=(8, 6), subplot_kw=dict(aspect="equal"))
s = data.groupby(x).size()
mydata_values = s.values.tolist()
mydata_index = s.index.tolist()
def func(pct, allvals):
absolute = int(pct/100.*np.sum(allvals))
return "{:.1f}%\n({:d})".format(pct, absolute)
wedges, texts, autotexts = ax.pie(mydata_values, autopct=lambda pct: func(pct, mydata_values),
textprops=dict(color="w"))
ax.legend(wedges, mydata_index,
title="Index",
loc="center left",
bbox_to_anchor=(1, 0, 0.5, 1))
plt.setp(autotexts, size=12, weight="bold")
ax.set_title(f'{x.capitalize()} Piechart')
plt.show()
pie_chart('Gender')
pie_chart('Profession')
1.Salaried is 56.7% of the sample population
2.Female is 43.3% of the sample population
pie_chart('Marital_status')
pie_chart('Education')
pie_chart('Make')
1.Hatchback has the highest count with 884 (55.9%) of the sample population
2.Sedan is 460(29.1%)
3.SUV is 237(15.0%)
sns.countplot(x = 'Make', data = data)
<AxesSubplot:xlabel='Make', ylabel='count'>
# Dispersion in type of car
sns.catplot(x='Price',
col='Make',
data=df,
col_wrap=4,
kind="violin")
plt.show()
# Dispersion around age
sns.catplot(x='Age',
col='Make',
data=df,
col_wrap=4,
kind="violin")
plt.show()
# Dispersion Total-salary
sns.catplot(x='Total_salary',
col='Make',
data=df,
col_wrap=4,
kind="violin")
plt.show()
# Dispersion around Make
sns.catplot(x='Salary',
col='Make',
data=df,
col_wrap=4,
kind="violin")
plt.show()
# Check for correlation among numerical variables
num_var = ['Age', 'No_of_Dependents', 'Salary', 'Partner_salary', 'Total_salary', 'Price']
corr = df[num_var].corr()
# plot the heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(corr, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral", xticklabels=corr.columns, yticklabels=corr.columns)
plt.show()
sns.heatmap(df.corr(), annot=True) # plot the correlation coefficients as a heatmap
<AxesSubplot:>
sns.pairplot(data=df[num_var], diag_kind="kde")
plt.show()
A partcularly interesting relationship between Age, Income and Make of car can be seen in this graph
df.corr() # displays the correlation between every possible pair of attributes as a dataframe
| Age | No_of_Dependents | Salary | Partner_salary | Total_salary | Price | |
|---|---|---|---|---|---|---|
| Age | 1.000000 | -0.157163 | 0.606094 | 0.150202 | 0.465891 | 0.790466 |
| No_of_Dependents | -0.157163 | 1.000000 | -0.043847 | 0.144508 | 0.088067 | -0.135839 |
| Salary | 0.606094 | -0.043847 | 1.000000 | 0.061943 | 0.623003 | 0.393038 |
| Partner_salary | 0.150202 | 0.144508 | 0.061943 | 1.000000 | 0.819309 | 0.156154 |
| Total_salary | 0.465891 | 0.088067 | 0.623003 | 0.819309 | 1.000000 | 0.348164 |
| Price | 0.790466 | -0.135839 | 0.393038 | 0.156154 | 0.348164 | 1.000000 |
plt.figure(figsize=(9, 9))
sns.histplot(data = df, x = 'Age', hue = 'Make')
plt.show()
plt.figure(figsize=(9, 9))
sns.histplot(data = df, x = 'Salary', hue = 'Make')
plt.show()
sns.scatterplot(df['Age'], df['Price']) # Plots the scatter plot using two variables
<AxesSubplot:xlabel='Age', ylabel='Price'>
sns.scatterplot(df['Age'], df['Salary']) # Plots the scatter plot using two variables
<AxesSubplot:xlabel='Age', ylabel='Salary'>
sns.scatterplot(df['Age'], df['Total_salary']) # Plots the scatter plot using two variables
<AxesSubplot:xlabel='Age', ylabel='Total_salary'>
plt.figure(figsize=(15,7))
sns.boxplot(df['Make'],df['Total_salary'])
plt.ylabel('Total_salary')
plt.xlabel('Make')
plt.show()
plt.figure(figsize=(15,7))
sns.boxplot(df['Make'],df['Salary'])
plt.ylabel('Salary')
plt.xlabel('Make')
plt.show()
plt.figure(figsize=(15,7))
sns.boxplot(df['Make'],df['Salary'])
plt.ylabel('Salary')
plt.xlabel('Make')
plt.show()
plt.figure(figsize=(15,7))
sns.boxplot(df['Make'],df['Price'])
plt.ylabel('Price')
plt.xlabel('Make')
plt.show()
pd.crosstab(df['Make'],df['Gender']).plot(kind="bar", figsize=(8,10),
stacked=True)
plt.legend()
plt.show()
pd.crosstab(df['Make'],df['Marital_status']).plot(kind="bar", figsize=(8,10),
stacked=True)
plt.legend()
plt.show()
Comment
1.There is a clear difference the prices of the each Make.
2.SUV is the most expensive car
3.Hatcback is the least expensive
data.groupby(['No_of_Dependents']).agg('count')['Total_salary']
No_of_Dependents 0 20 1 229 2 557 3 557 4 218 Name: Total_salary, dtype: int64
In the dataset, approximately 84% (1332 / 1581) of the Total salary have 2 and more dependents .
data.groupby(['Make']).agg('count')['Total_salary']
Make Hatchback 884 SUV 237 Sedan 460 Name: Total_salary, dtype: int64
1.Suv is the least car bought by total_salary,
2.This indicate household with two salary prefer Hatchback and Sedan
3.indicating more than one car in household, Since both husband and wife are working.
# let us try pandas-profiling now and see how does it simplifies the EDA
# !pip install pandas-profiling==2.8.0
from pandas_profiling import ProfileReport
# Use the original dataframe, so that original features are considered
prof = ProfileReport(df)
# to view report created by pandas profile
prof